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Abstract 

We study a bandit problem where observations from each arm have an exponential family 
distribution and different arms are assigned independent conjugate priors. At each of n 
stages, one arm is to be selected based on past observations. The goal is to find a strategy 
that maximizes the expected discounted sum of the n observations. Two structural results 
hold in broad generality: (i) for a fixed prior weight, an arm becomes more desirable as its 
prior mean increases; (ii) for a fixed prior mean, an arm becomes more desirable as its prior 
weight decreases. These generalize and unify several results in the literature concerning 
specific problems including Bernoulli and normal bandits. The second result captures an 
aspect of the exploration-exploitation dilemma in precise terms: given the same immediate 
payoff, the less one knows about an arm, the more desirable it becomes because there remains 
more information to be gained when selecting that arm. For Bernoulli and normal bandits 
we also obtain extensions to nonconjugate priors. 

Keywords: Bernoulli bandits; convex order; log-concavity; optimal stopping; sequential 
decision; two-armed bandits. 
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1 Introduction 

At each of n stages, an experimenter must take an observation from one of two stochastic 
processes (arms). Let us adopt the Bayesian framework and assume that the experimenter's 
belief about an unknown arm is updated according to Bayes Theorem after each observation. 
A strategy specifies which process to select at each stage. The objective is to maximize the 
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expected payoff, J27=i a iZi, where Zj is the observation at stage i and A n = (01,02, . . . ,a n ) is 
a discount sequence satisfying Oj > and X^Li °i > 0- A strategy is optimal if it achieves the 
maximum expected payoff. This is a finite-horizon two-armed bandit (Berry and Fristedt 1985), 
a classical problem in sequential decision theory. 

Bernoulli bandits, where each arm generates binary observations, are important as a model 
for clinical trials, and have received considerable attention (Berry 1972; Berry and Fristedt 
1985). Others such as normal (Chernoff 1968; Chernoff and Petkau 1986; Yao 2006) and Dirichlet 
bandits (Clayton and Berry 1985; Yu 2011) have also been extensively studied. Bandit problems 
exhibit a well-known exploration-exploitation tradeoff. Simply maximizing the immediate payoff 
is usually not an optimal strategy; one must allow for exploring an unknown arm for higher payoff 
later on. From a Bayesian perspective, the optimal strategy is easily specified through backward 
induction, although its computation can be nontrivial. If the discount sequence is geometric, 
then the problem reduces to several one-armed bandits (Gittins and Jones 1974; Gittins 1979; 
Whittle 1980; Kaspi and Mandelbaum 1998) and the optimal strategy is to choose an arm with 
the highest dynamic allocation index, or Gittins index. Optimal strategies for general discount 
sequences are less tractable. 

The Gittins index possesses intriguing monotonicity properties with respect to prior specifi- 
cations. For example, Gittins and Wang (1992) show that the Gittins index decreases in r > 
for some special bandit arms: a Bernoulli arm whose unknown parameter has a Beta(rs, t(1 — s)) 
prior (0 < s < 1), or a normal arm whose unknown mean has a N(/x, 1/r) prior (/i £ R). In both 
cases r is naturally interpreted as the amount of prior information. Such monotonicity results 
therefore capture an aspect of the exploration-exploitation dilemma in precise terms: given the 
same immediate payoff, the less one knows about an arm, the more desirable it becomes since 
there is more room for exploration. In the literature, however, this monotonicity is usually 
derived for one-armed bandits and on a case-by-case basis. This paper aims to obtain more 
general results in a unified framework. 

The Bernoulli and normal bandits can be regarded as special cases of a general bandit 
where observations from each arm have an exponential family distribution. Assume each arm 
is assigned an independent conjugate prior, which is characterized by a prior mean and a prior 
weight. The prior mean specifies the immediate payoff of an arm, whereas the prior weight 
reflects the associated uncertainty. For such problems we show that: (i) for fixed prior weight, the 
maximum expected payoff increases as the prior mean for any arm increases; (ii) for fixed prior 
mean, the maximum expected payoff increases as the prior weight for any arm decreases. These 
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generalize and unify several results in the literature concerning specific distributions. Similar 
techniques yield parallel results for Dirichlet bandits, which do not fit in the one-parameter 
exponential family framework (Clayton and Berry 1985; Chattopadhyay 1994; Yu 2011). 

The rest of the paper is organized as follows. After setting up the exponential family frame- 
work and introducing a few notions of stochastic ordering in Section 2, we present basic struc- 
tural results such as a stay-on-a-winner rule in Section 3. Section 4 contains the main results, 
including monotonicity of the value function with respect to prior weights. Section 5 applies 
the results in Section 4 to one-armed bandits. In particular, we show that the break-even value 
decreases as the prior weight of the unknown arm increases. In Sections 6 and 7 we extend 
the monotonicity results to nonconjugate priors for Bernoulli and normal bandits, respectively. 
Section 8 concludes with a brief discussion on an open problem. 

2 Preliminaries 

Let v be a cr-finite measure on R that is not a point mass. Denote 

if>(9) = log J e dx du{x), # G 9, 

where is the natural parameter space defined as the set of 9 G R such that ip{9) is finite. We 
assume that has a non-empty interior. Suppose that given 9i, observations from arm i are 
independent and identically distributed (i.i.d.) according to the density (relative to u) 

f(x\9i) = e BiX -^ Bi \ (1) 

Let us assume independent conjugate priors on 9i, i = 1,2, with Lebesgue density 

/(^iT^r^oce^-^), 6 t G G. (2) 

Let 1C denote the smallest open interval such that v assigns no mass outside of the closure 
K,. To ensure that the priors are proper, we require r% > and ji/ri G /C (Brown 1986, 
Chapter 4). As usual r, is regarded as the "prior sample size" and ji the "prior sum of ob- 
servations". We refer to ([2]) as the (7i,Tj) prior and call this two-armed bandit with discount 
sequence A n the (71, n; 72, T2; A n ) bandit. Its value (i.e., maximum expected payoff) is denoted 
by V(-/ 1 ,T 1 ;j 2 ,T 2 ;A n ). 

This framework unifies several well-studied bandit reward structures: (i) Bernoulli rewards 
whose unknown parameter has a Beta(7, T ~ l) prior; (ii) normal rewards whose unknown 
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mean has a N(7/V, 1/t) prior; (iii) exponential rewards whose unknown rate parameter has a 
Gamma(r+1,7) prior; (iv) Poisson rewards whose unknown rate parameter has a Gamma(7,r) 
prior. Extensions to general priors for (i) and (ii) are considered in Sections 6 and 7, respectively. 

Let V*(7i, n; 72, T2; A„) be the expected payoff when selecting arm i initially and using an 
optimal strategy thereafter. Then 

V("fi,T 1 ;"f2,T 2 ;A n ) = max{y 1 (7i,ri;72,r2;^„),F 2 (7i,ri;72,T2;^„)} , (3) 

and it is optimal to start with the arm whose V 1 is larger. Suppose arm 1 is selected, resulting in 
an observation X. By conjugacy, the posterior for 0\ is again of the form of ([2]) with (71 -\-X, ri+1) 
in place of (71,11). Thus we have 

V 1 (~f l ,T l ;~ f2 ,T 2 ;A n )=a l m+ E[V{ 11 +X,T 1 + l;~ f2 ,T 2 ;A 1 n )\~f l ,T 1 ] , (4) 
V 2 (ji,n;j 2 ,T 2 ;A n ) = aiH2 + E [V (71, TUTS + Y,t 2 + 1;^)| 72, r 2 ] , (5) 

where A\ = (a 2 ,as, ... ,a n ) and fa denotes the expected value of an observation from arm i 
under the (7i,Tj) prior. This fa is simply fa = Ji/ri, which we refer to as the prior mean. In 
E\g(X)\ n fx, Ti], we use X to denote a generic observation from arm 1 under the (71, ti) prior; 
similarly for Y. That is, the density of X relative to v is 

f(x) oc / e *(7i+z)-(n+lM0) d 0. (6 ) 
Je 

The dynamic programming equations ([3|)-([5]) are crucial for both theoretical analysis and nu- 
merical computation of the optimal strategy. 

A key tool in our derivation is the notion of stochastic ordering (Miiller and Stoyan 2002; 
Shaked and Shanthikumar 2007). We shall use the usual stochastic order < st , the convex order 
< cx , the likelihood ratio order <i r , and the relative log-concavity order <i c . For random variables 
Z\ and Z 2 taking values on R, we write Z\ < s t Z 2 (respectively, Z\ < cx Z 2 ), if E(f)(Zi) < E(j){Z 2 ) 
for every increasing (respectively, convex) function (ft such that the expectations exist. If Z\ < st 
Z 2 then we also say Z 2 is to the right of Z\. If Z\ and Z 2 have densities fi(z) and f 2 (z) 
respectively, supported on the same interval, then we write Z\ < lr Z 2 (respectively, Z\ <\ c Z 2 ) if 
log [f\{z)/f 2 {z)) is decreasing (respectively, concave) in z. For example, the (7, r) prior increases 
in the likelihood ratio order as 7 increases, and decreases in the relative log-concavity order as 
r increases. (We use <i r , < st , <i c and < cx with densities as well as random variables.) Useful 
properties include the implication <i r ^=^< s t- Assuming equal means, it also holds that <i c 
implies < cx . Intuitively, the relative log-concavity order compares the amount of information 
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as it is defined through curvatures of the log density functions. Both <i r and <i c are preserved 
under the prior-to-posterior updating, which makes them ideal for studying structural properties 
in bandit problems. The log-concavity order is also useful in other seemingly unrelated contexts 
(Whitt 1985; Yu 2009a, 2009b, 2010). 

3 Stay-on-a-winner 

This section derives a basic monotonicity property of the optimal strategy: as the observation 
from an arm becomes larger, the inclination to pull that arm again also increases. Under 
suitable conditions we prove a generalized stay-on-a-winner rule, which is a natural extension 
of the results for Bernoulli bandits (Bradt, Johnson and Karlin 1956; Berry 1972; Berry and 
Fristedt 1985). 

Let us define the advantage of arm 1 over arm 2 as 

A(7i,r 1 ;7 2 ,r 2 ; A n ) = V 1 (ji,Tr,72,r 2 ; A n ) - U 2 (7i, n; 72, r 2 ; A n ). 

Define A + = max{A,0} and A~ = min{A,0}. By considering the initial two pulls one can 
show (Berry 1972) 

A(7i,ti; 72, T 2 ;A n ) =(a 1 - a 2 ) ( — ) (7) 

+ £[A+(7 1 + X,t 1 + 1;7 2 ,t 2 ;^)|7 1 ,t 1 ] (8) 
+ £[A-(7 1 ,t 1 ;7 2 + Y,T2 + 1;40|7 2 ,t 2 ] . (9) 

Proposition [1] states that as the prior mean of arm 1 increases, so does the advantage of 
arm 1 over arm 2, assuming A n is decreasing. This can be extended to non-conjugate priors. 
Specifically, A increases as the prior for arm 1 becomes larger in the likelihood ratio order. 
Extensions to general Markov decision problems are also possible (Rieder and Wagner 1991). 
We provide a complete proof which serves as an introduction to the derivation of the main results 
in Section 4. 

Proposition 1. Suppose A n is decreasing. Then A(7i,ti;7 2 ,t 2 ; A n ) increases in j\. 

Proof. The n = 1 case is easy. Let us use induction for n > 2. In view of (O-Q, we only need 
to show that 

j B[A+(7 1 +X,t 1 + 1;72,t 2 ;^)|7i,ti] and (10) 
£;[A-(7 1 ,r 1 ;7 2 + F,r 2 + l;4)|72 ) r 2 ] (11) 
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both increase in 71. Monotonicity of (|lip follows from the induction hypothesis. To handle (|10p . 
let us consider 71 < 71. Let 6\ and 9\ have the (71,71) and (71, T\) priors respectively. Let 
g(x) (respectively, g(x)) be the marginal density of X if it is drawn according to ([1]) given 9\ 
(respectively, #1). Note that d\ <i r B\. In view of ([6]), we know that g <\ T g by total positivity 
considerations (Karlin 1968, Chapter 3). It follows that g < s t g. By the induction hypothesis, 

<j>(x) = A + (x,ti + 1;7 2 ,T 2 ;A^) 

increases in x. Thus 

E [0(7i + X)| 7 i,Ti] < S[0(7i + X)| 7 i,ri] 

< E + X)\j 1 ,t 1 ], (12) 

where (fT2j) holds because g < s t g. Hence (fTUj) increases in 71. □ 

Corollary 1. Suppose A n is a decreasing sequence, and an observation x is taken from arm 1 
initially. Then, at the second stage, either arm 1 is optimal for all x, or arm 2 is optimal for 
all x, or there exists some x* € fC such that arm 1 is optimal if x > x* and arm 2 is optimal if 

X X • 

Proof. We can show that A(7i+x, t\+1; 72, A„) is continuous in x. (One method is to use the 
convexity result of Proposition [2] in Section 4.) The claim then follows from Proposition [TJ □ 

The next result, Theorem[TJ is a generalized stay-on-a-winner rule: under suitable conditions 
if an arm is optimal initially then it continues to be optimal at the next stage provided that the 
initial observation from that arm is large enough. 

Theorem 1. Assume A n is decreasing, n > 2, and either (i) a\ = a 2 or (ii) 71 /n < 72 /r 2 
holds. Assume A (71, T\\ 72, t 2 ; A n ) > 0, i.e., arm 1 is optimal initially. Then A (71 + x,t± + 
1; 72, t~2] A\) > for sufficiently large 

Proof. We may assume a, > for all i < n. Let U be the upper end point of K. If U = 00, then 
using ([T])-©, it is easy to show by induction that A(7i + x, t\ + 1; 72, T2; A^) > for sufficiently 
large x. That is, the claim holds even without assuming that arm 1 is optimal initially. Assume 
U < 00 and A(7i,ti;72,t 2 ; A n ) > 0. By dZ])-® we have 

< E [A+( 7 i + X, n + 1; 72, r 2 ; | 7 i , n] + E [A~ (71 , n; 72 + Y, r 2 + 1; A*) I72, r 2 ] . (13) 



6 



Suppose the claim does not hold, i.e., A (71 + x, t\ + 1; 72, A\) < for all x & JC. In particular, 

A( 7 i + U,n + l;72,r 2 ;^) < 0. (14) 

Then it is necessary that both expectations in (fT3l) are zero. That is, 

A(7i, n; 72 + y, r 2 + 1; A*) > for all y e K. 

By continuity, A (71, T\\ 72 + f/, T2 + 1; A\) > 0. However, the (71 + U, T\ + 1) prior is larger than 
the (71, n) prior in the likelihood ratio order. The argument of Proposition [1] yields 

A(7i + C/,r 1 + l;72,T 2 ;^) > A( 7l , n; 72 , r 2 ; 4j 

> A( 7l ,r 1 ;72 + f/,r 2 + l;^) > 0, 

which contradicts (HH). □ 



4 Monotonicity 

Proposition [2] shows that the maximum expected payoff is an increasing and convex function 
of the prior mean of any arm. The convexity will be useful in proving Theorem [2] concerning 
monotonicity with respect to the prior weight. 

Proposition 2. V(7i, T\\ 72, ti\ A n ) is increasing and convex in each of ji, i = 1,2. 

Proof. Monotonicity holds by the same argument that proves Proposition [TJ Let us focus on 
the convexity with respect to 71. The n = 1 case is easy. For n > 2 we use induction. Note that 
by ©-(15]) it suffices to show that both 

E[V(j 1 + X,T 1 + l;i2,T 2 ;Al l )\ ll , n ] and (15) 

E [^(71, ti; 72 + Y, t 2 + 1; 72 , r 2 ] (16) 

are convex in 71. The claim for (|16() follows from the induction hypothesis. To deal with (I15p . 
suppose 71 < 71. Denote the marginal of X when the prior on 6 is (71, 17) (respectively, (71, 17)) 
by 5 (respectively, 5). Then 5 < st 5 as in the proof of Proposition!]] By the induction hypothesis, 

4>(x) = V{x,n + l; 7 2,r 2 ;^) 
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is convex in x. Moreover, 



where 



£[<£( 7 i + X)| 7 i,ti]-£ 
> E [ V (X)\K, n ] 



71 "I - 71 
r/(a;) = ( h x 



7i + 7i 



71)7"! 



(17) 
(18) 



0(71 



The inequality ([T7"|) holds because 4> is convex; (fT8|) holds because r\ is decreasing and g < st 5. 
Rearranging we get 

E [0(7i + X) I 71 , n] + E [0(7i + X) 1 7! , n ] > 2£0 ( H±Jl + A' ! 



where X* has the following distribution. Given 9, X* is distributed according to ([I]); the prior 
on 9 is a half-half mixture of (71, n) and (71, rj). Denote this mixture density by h*(9), and the 
(( 7 l+7l)/2, ri) prior density by h(9). Then h{9) <\ c h*(9), because log-convexity is closed under 
mixtures (Marshall and Olkin 1979). Consider the difference between the marginal densities 



D(x) 



dO. 



Relative log-concavity implies that, as 9 traverses 0, h(9) — h*{9) changes signs at most twice 
and, in the case of two changes, the sign sequence is —,+,—. By the variation-diminishing 
properties of the Laplace transform (Karlin 1968, Chapter 5), D(x) has at most two changes of 
sign, and in the case of two changes, the sign sequence is —,+,—. Note that, when the prior 
is either h or h*, the marginal mean of X is the same, namely (71 + 7i)/(2ri). Hence it is not 
possible for D(x) to change signs exactly once. Unless D{x) = 0, its sign sequence must be 
— ,+,—. It follows that the marginal distribution of X becomes larger in the convex order when 
h(9) replaces h(9) as the prior for 9 (see, e.g., Yu 2010, Lemma 1). Using the convexity of <fi 
again, we obtain 



> E 



7i + 7i 



+ X 



7i + 7i 



It follows that E [0(71 + X)| 7 i,n], i.e., (fT5|) . is convex in 71, as required. 



□ 



Our main result, Theorem [21 shows that the value of the bandit decreases as the prior weight 
of an arm increases. That is, given the same immediate payoff, an arm becomes less desirable 
as the amount of information about it increases. 
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Theorem 2. V(cyi, cri; 72, r 2 ; A n ) decreases in c G (0, 00). 

Proof. Let us use induction on n. The n = 1 case is easy. Suppose n > 2. In view of ([3|)-([5j), 
we only need to show that 

E [V(cti + X, en + 1; 7 2,t 2 ;^)| c 7 i, en] and (19) 

£ [F(c7i, en; 72 + y, r 2 + 1; A l n )\ 72 ,r 2 ] (20) 

both decrease in c. By the induction hypothesis, (|20l) decreases in c. To deal with (fl9l) . suppose 
< c < c and denote £ = (cri + l)/(cn + 1). We get 

E [V(cji + X, en + 1; 72, r 2 ; 071, cri] 

> E[V^{c 11 + X),cyr l + l- 1 2,r 2 ;A l n )\c 11 ,CT l ] (21) 

> £ [V(c7i + X, en + 1; 72, r 2 ; ^)| en, cri] (22) 

> £ [F(c 7 i + X, cri + 1; 72, r 2 ; A l n )\ 671, cri] • (23) 

The inequality (|2ip holds by the convexity of V as shown by Proposition [21 noting 

£(£71 + X) < cx C71 + X 

(see Lemma[3]in Section 7, or Shaked and Shanthikumar 2007, Theorem 3. A. 18). The inequality 
(|22p holds by the induction hypothesis, as £ < 1. The inequality (|23p holds by an argument 
similar to the proof of Proposition [21 Specifically, the prior (071, cri) is log-concave relative to 
(C71 , cri ) • Thus the marginal of X increases in the convex order if (C71 , cri ) replaces (£71 , cri ) 
as the prior on 6 (the mean of X remains constant). Overall (|19p decreases in c, as required. □ 

Remark. Proposition [2] and Theorem [2] extend naturally to bandits with more than two 
arms. We present the two-armed version for simplicity. The discount sequence A n is only 
required to be nonnegative. By approximation, this can be further extended to the infinite- 
horizon case assuming Y^Li a i < °°- 

5 The one-armed case 

This section considers the one-armed case assuming that arm 2 yields a constant payoff A at 
each pull. We shall abuse the notation by calling this a (7,r; A; A n ) bandit, where we drop the 
subscripts on 71 and t\ for convenience. Results in Section 4 are applied to derive monotonicity 
properties of the break-even value in this case. It is also shown (Proposition [3]) that if both 
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arms are optimal initially, then an observation from arm 1 that is less than its prior mean would 
make arm 2 optimal thereafter. 

A discount sequence A n = (01,02,...) is called regular if, letting bj = Ylii>j a i-> we have 
b 2 j + i > bjbj + 2 for all j > 1 (Berry and Fristedt 1979). For regular discount sequences, our 
one-armed bandit is an optimal stopping problem, i.e., if at any stage the known arm becomes 
optimal then it remains optimal in all subsequent stages. Moreover, if A n is regular and a\ > 0, 
then there exists a break-even value A(7,r; A n ) for the (7, T;X;A n ) bandit, such that arm 1 
is optimal initially if and only if A < A (7,7"; A n ) and arm 2 is optimal initially if and only if 
A > A(7,t; A n ). For infinite-horizon geometric discounting, this break-even value is also known 
as the dynamic allocation index or Gittins index (Gittins and Jones 1974). The following result 
holds by the optimal stopping characterization. 

Lemma 1. If A n is regular and a\ > 0, then A(j,t; A n ) is the smallest A such that 

n 

V(j,T;\;A n ) < A^ ai . 

i=l 

Corollary [2] summarizes some monotonicity properties of A(7, r; A n ). It extends to infinite- 
horizon regular discounting. As special cases we recover the results of Gittins and Wang (1992) 
on Bernoulli and normal bandits with geometric discounting; see also Yao (2006). 

Corollary 2. If A n is regular and a\ > 0, then A(ay, cr; A n ) decreases in c > and strictly 
increases in 7. 

Proof. Monotonicity in c follows from Theorem [2] and Lemma [TJ Monotonicity in 7 follows from 
Proposition [2] and Lemma [TJ To show strict monotonicity, let us set c = 1 and assume that 7,7 
satisfy 7 < 7 and 

A( 7 ,r;^ n ) = A(j,r;A n ) = A,. 
Then, as in the proof of Proposition [TJ we get 

n 

i=i T 

<ai Z + £[y( 7 + X,T + l;A*;<,)|7,T] 

T 

< aA + E [^(7 + X, r + 1; A*; A l n )\ 7, r] 

T 

n 

= A* Oi, 
i=l 

which is a contradiction. □ 
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For a regular and positive discount sequence A n , Proposition [3] shows that there exists a 
break-even observation 6(7, r; A n ) for the (7, r; A; A n ) bandit such that if both arms are optimal 
initially, and an observation x is taken from arm 1, then arm 1 remains optimal if x > 6(7, r; A n ) 
and arm 2 becomes optimal if x < 6(7, r; A„). Moreover, this break-even observation is no 
smaller than 7/r, the prior mean. 

Proposition 3. Suppose A n is regular, n > 2, and 0,1,0,2 > 0. Then there exists a unique 
6(7, r; An) € /C smc/i that 6(7, r; A n ) > 7/T ancf 

A( 7 ,r;A„) > A( 7 + x,r + 1; A^), if x < 6(7, r; A n ); (24) 
A( 7 , r;A„) < A( 7 + x,t + 1; A*), if x > 6(7, r; A n ). (25) 

To prove Proposition [3] we need a continuity lemma. Its proof, taken from Clayton and Berry 
(1985), is included for completeness. 

Lemma 2. Suppose A n is regular and a\ > 0. Then A(7,r;^ n ) is continuous in 7. 
Proof. Fix 70 and note that A = A( 7 , r; A n ) is the unique root of 

V l ( 1 ,r;X;A n )-V 2 (l,r;X;A n ) = 0. 
By continuity of V 1 and V 2 , we have 

= lim [V\-y, r; A( 7 , r; A n ); A n ) - V 2 ( 7 , r; A( 7 , r; A n ); A n )] 

7t70 

= F 1 (7o,t; lim A( 7 , r;A„);A„) - F 2 (7o,r; lim A(^,t; A n ); A n ). 

7t7o 7T70 

By uniqueness of A, we have lim 7 ^ 70 A(-j,t; A n ) = A(7o, r;A n ). Similarly, the limit holds when 
7 I 7o- □ 

Proof of Proposition Let U be the upper end point of K. If U = 00 then A(7 + x, r + 1; A^) — >• 
00 as x — > 00 (the expected payoff by always selecting arm 1 becomes arbitrarily large). If U < 00 
then we can show A(7 + U,t + 1; A n ) > A(j, r; A n ) as follows. Assume the contrary and consider 
the (7, r; A*; A n ) bandit with A* = A(7 + U,t + I; A n ). We have 

n 

Kj2 ai < a i 1 + E [V(l + X,r + l;X t ;A 1 n )\j,T] . 
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Since 7/r G K, and K, is open, we have A* > (7 + U)/(t + 1) > 7/1". Thus 

n 

A*^a 4 < ^[y(7 + X,T + l;A*;^)|7,r] 

2=2 

<y( 7 + ^T + l;A,;4j 

n 

= A* dj, 
i=2 

which is a contradiction. We also have 

A( 7 ,r;A n ) > A( 7 ,r;^) 

> A( 7 + 7 /r,r + l;Ai) 

where the first inequality holds by the optimal stopping characterization, and the second by 
Corollary [2j 

By Lemma [5] and Corollary [21 A(7 + x,r + is continuous and strictly increasing in x. 

By the mean value theorem, there exists a unique 6(7, r; A n ) G [ 7 /t, £7) such that ([25]) and ([25"]) 
hold. □ 

It is tempting to conjecture that 6(7, r; A n ) > A(7, r; A n ), which gives a tighter bound since 
A(7,r;j4 n ) > 7/T. However, our methods are not yet strong enough to resolve this conjecture. 
Clayton and Berry (1985) conjectured and Yu (2011) proved an analogous bound for Dirichlet 
bandits. 



6 Bernoulli bandits with general priors 

As noted earlier, results based on likelihood ratio orders, such as those in Section 3, may extend 
to nonconjugate priors. This section shows that Theorem [2] can also be extended this way, at 
least in the Bernoulli case. 

Given pi, i = 1, 2, let us assume that observations from arm i are i.i.d. Bernoulli (j>i). Priors 
on pi are independent with densities fi with respect to a <7-finite measure G on [0, 1]. We shall 
denote the value of this Bernoulli bandit with discount sequence A n by Vb(/i; f%;A n ). Let p,(f) 
denote the mean of any prior /, i.e., fi(f) = Jj 1 , pf(p) dG(p). 

Theorem 3. If fx < k f x and n(f x ) = n(f x ), then V B {frJ 2 ]A n ) < V B {ff, / 2 ; A n ). 

Note that the Beta(ca, cf3) prior (c, a, j3 > 0) decreases in the relative log-concavity order as 
c increases. Theorem [3] therefore recovers the Bernoulli case of Theorem [2] for conjugate priors. 
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Let Ab(/; A n ) denote the break-even value of a one armed Bernoulli bandit whose unknown 
arm has prior /. We obtain Corollary [3] as a consequence of Theorem [3] and Lemma [TJ 

Corollary 3. Assume A n is regular and a\ > 0. If f <\ c f and n(f) = fJ-(f), then Ab(/; A n ) < 
A B (/;40. 

Herschkorn (1997) posed the problem of identifying a variability ordering between priors 
so that both Vb and Ab are monotonic with respect to it. Theorem [3] and Corollary [3] show 
that there is indeed such an ordering, namely the relative log-concavity order (assuming equal 
means). A conjecture of Herschkorn (1997) states that Corollary [3] holds under the weaker 
assumption / < cx /. This conjecture remains open. 

Proof of Theorem [3 The n = 1 case is easy. For n > 2 we use induction. The equations ([3])-([5]) 
become 

V B (f l ;f 2 -A n )=m a x{Vi(f 1 ;f 2 ;A n ), / 2 ; A n )}; 

^B(/i;/ 2 ;vin) = /i(/i)(ai + yBWi;/ 2 ;^)) + (i-M/i))VB^/i;/2;^); (26) 
y B 2 (/ i; / 2 ;^ n ) = M/ 2 )( ai + y B (/i;^/ 2 ;^)) + (i - M/2))^b(/i;0/ 2 ;^). 

We use af (respectively, 4>f) to denote the posterior density after observing one success (respec- 
tively, one failure). That is, 

( f\( \ f&)P /Af\( ^ /(p)(l ~p) 

{<jf){p) = M/T T^C/T' 

Let us assume f\ is nondegenerate. Because f\ <\ c f\ and n(fi) = fJ-(fi) we have f\ < cx f\ 
(see, e.g., Yu 2010, Theorem 12). Thus 



MM<rfi) = / P 2 MP) dG(p) < / P 2 h(p) dG(p) = /i(<7/i)/x(/i), 
J[0,l] J[o,i] 

yielding ^(afi) < n(afi). Similarly, fJ,((f>fi) > n((j>h)- Define 

* ri°h) ~ /4°7l) M(0/i) - fJ-Wi) 



Then e*,e* € [0,1). Define 

g* = (1 - e*)afi + e*<£/i; g* = e*afi + (1 - e*)<£/i. 

Convexity of Vb with respect to mixtures gives 

V B (g*-h;A l n ) < (l- e *)VB(a/i;/ 2 ;^) + eV B (0/i;/ 2 ;^); 
Vb(p*; / 2 ; 4) < e^bKi;/ 2 ;4) + (l - e*)F B (0/i; / a ; 
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Noting /i(/i)e* = (1 — A t (/i))e*, we add n(fi) times the first inequality to 1 — times the 

second and get 

K/OW; f2-,A l n ) + (l - K/i))Vb(s*; -^n) 

<m(/i)V b (<7/i; / 2 ; + (l - tih))V*{<i>h\h\ K)- (27) 



The density g* is simply 



S*(p) 



p(l-e*) (l-p)e* 



7i(p). 



MA) 

It is easy to check (1 — e*)///(/i) > e*/(l — /u(/i)), which leads to 

c/l <lc cr/l <lc 9*- 

Moreover, af\ and g* have the same mean. By the induction hypothesis, we have 

V B (*fi;fr,Ati < Vk(jg*;fc,Ati. (28) 

Similarly, 

V^h-h-Ai) < V B (g*;f 2 ;A 1 n ). (29) 

We combine PI])-® to get 

^h)VB{ah-J 2 -Al) + (1 - M(/i))^B^/i;/ 2 ;^n) 

</i(/i)Vb(a/i;/ 2 ;^) + (l - m(/i))W/i;/ 2 ;4J. 

Applying (|26l) then yields 

Fi(/i;/ 2 ;40<Fi(/i;/2;4»)- 
The rest of the proof is standard. □ 

Remark. Theorem [3] focuses on the parameter p. If we still require equal prior means for 
p, but impose the log-concavity order on 6 = log(p/(l — p)) rather than p, then Vb is ordered 
by virtually the same proof. This result is distinct from Theorem [3] because the relative log- 
concavity order is usually not preserved by monotone transformations. 
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7 Normal bandits with general priors 

The main result of this section (Theorem [3]) extends Theorem [2] to general priors for normal 
bandits. Similar to Theorem[3j Theorem|3]is based on the relative log-concavity order, although 
it is more restrictive because we only compare a general prior with a normal prior. 

Given 9i, i = 1,2, let us assume that observations from arm i are i.i.d. N(0j, 1). Priors on 
Oi are independent with Lebesgue densities We shall denote the value of this normal bandit 
with discount sequence A n by Vn(/i; /b; A n ). Denote the mean of any / by //(/) = Of (9) dO. 

Theorem 4. Let A = N(a, l/r). 

1- If h <k A an d M(/i) = ^en Vn(/i; / 2 ; AO < F N (A; AO- 

«. // A <k /i and /^(/i) = a, then Vn(A; /2! ^n) < V N {fr, / 2 ; A n ). 

Let AN(/;^4n) denote the break-even value of a one-armed normal bandit with prior / for 
the mean of the unknown arm. We obtain Corollary 2] as a consequence of Theorem U] and 
Lemma [TJ 

Corollary 4. Assume A n is regular and a\ > 0. Define f = N(a, l/r). 
#7<ic / andfj,(f) = a, then A N (f;A n ) < A N (f;A n ). 
2. If f <ic / and fj,(f) = a, then An (A Ai) < A N (/; A n ). 

The condition / <i c N(a, l/r) is essentially d 2 log/(0)/d^ 2 < — r, which can be regarded as 
a strong form of information ordering. The appearance of <i c is therefore especially intuitive 
in Theorem H] and Corollary [H It is an open problem whether Theorem H] and Corollary H] hold 
without assuming that one of the priors is normal. 

The rest of this section proves Theorem |U We need a technical result (Lemma [3|) which may 
be of independent interest. 

Lemma 3. Let g be a differ entiable function on R. Assume X is a random variable satisfying 
Eg(X) = EX. 

1- If0< g'{x) < 1, x G R, then g(X) < cx X. 

2. If g'{x)>\, x 6 R, then X < cx g(X). 
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Proof. We prove Part 1 only. Part 2 follows from Part 1 by considering the inverse function of 
g. As Eg(X) = EX, one criterion for g(X) < cx X is 

E max{0, g(X) - b} < E max{0, X - b}, b € R. (30) 

See, e.g., Shaked and Shanthikumar (2007; Theorem 3.A.1). Let us assume < g'{x) < c for 
some < c < 1. Otherwise we consider cg(x) and let c j 1. As g(x) is a contraction, it has a 
unique fixed point, say Xq. Consider two cases. 

Case (i): b > xq. If x > xq then g(x) — g(xo) < x — xq, i.e., g(x) < x, and max{0, g(x) — b}< 
max{0, x — b}. If x < xo then g(x) < g(xo) = xq and 

max{0, g(x) — b} < max{0, xq — b} = < max{0, x — b}. 

In either case max{0, g(x) — b} < max{0,x — b}, which implies (I30p . 

Case (ii): b < xq. Applying the argument of Case (i) to g(x) = —g{—x) and X = —X yields 
£max{0,6- g(X)} < Emax{0,b- X}, which reduces to (J5DD because Eg(X) = EX. □ 

Proof of Theorem^ We only prove Part 1; the second part is similar. The n = 1 case is easy. 
For n > 2 we use induction. The equations ©-([H]) become 

VM;f 2 ;A n ) = m a x{V^(f 1 ;f 2 ;A n ), F N 2 (/ i; fa A n )} ; 

V^f 1 -f 2 -A n ) = a lf i(f 1 ) + E[V N (f^;f 2 ;A 1 n )\^f 1 ] ; (31) 

V$(h;f 2 ;A n ) = ai n(f 2 ) + E . 

We denote the posterior ff(6) oc fi(9)exp[-(x - 6) 2 /2}; similarly for /f. In E[g(X)\$f], the 
density of X, denoted by $/, is the convolution of / with the standard normal. (Note the 
difference from the notation in Section 2.) Let m(x; f) denote the posterior mean of 9 when x 
is observed and the prior is /, i.e., m(x;f ) = 9f x {9) d9. Direct calculation yields 

^^ = Var(9\ n . (32) 

That is, the derivative of m(x; f) is simply the posterior variance of 9. 
Suppose fi <i c fi = N(a, 1/r) and /u(/i) = a. Then 

ft <icN^m(x;/i),^). (33) 

It can be shown that (i) if X is distributed as 3>/i, then m(X;fi) < cx (X + tol)/(t + 1); (ii) 
<E>/i is smaller than <J>/i = N(a, 1 + 1/r) in the convex order. To prove (i), note that (j33j) holds 
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with <i c replaced by < cx as the two sides have equal means. By (|32j) we have 

x G R. 



< dm(g;/i) < 1 



dx — r + 1 

If X is distributed as 3>/i then both (X + to) /(t + 1) and m(X; /x) have mean = a. Thus 

claim (i) holds by Lemma El Claim (ii) holds because fx < cx f\ and the convex order is closed 
under convolution. 
We have 

1 



Vh N m(X;/i), 



T + l 



^[VnC/i^;^;^)!*/!] < S 

< £ 

where the first inequality holds by (|33p and the induction hypothesis, the second by claim (i) 
noting 



(/7 


h 






(fl 


h 


a{) 





N 



X + ra 1 



r + 1 't + 1 / 

and the third by claim (ii). The last two inequalities also use the convexity of Vn with respect to 
the mean of a normal prior, i.e., Proposition [2l (Although Proposition [2] assumes normal priors 
for both arms, this can be relaxed.) It follows from (|3ip that 



V^(h;f 2 ;A n ) <^(/i;/ 2 ;A n ). 
The rest of the proof is standard. 



□ 



8 Discussion 

Results in previous sections suggest the following conjecture. Consider a two-armed bandit in 
the general exponential family setting with conjugate priors. Suppose the prior expected yield 
of one pull from each arm is the same, but the prior weight of arm 1 is larger. Then it seems 
reasonable that arm 2 is optimal at the first stage, i.e., in the notation of Section 3, 

71 72 

_ = _ and n > r 2 => A(7 1 ,t 1 ;72,t 2 ; A n ) < 0. 

n t 2 

This holds if the discount sequence is infinite-horizon geometric. Indeed, it is optimal to pull 
arm 2 because, according to Corollary [21 arm 2 has a larger Gittins index. For non-geometric 
discounting, we cannot apply Corollary [2] due to the lack of an index policy. In fact, Berry 
(1972) proposed this conjecture for Bernoulli bandits with uniform discounting, and this special 
case is still open. 
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